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INTRODUCTION 


This  report  summarizes  the  literature  in  the  areas  of  pitch  and  time-scale  modification 
of  speech.  The  report  primarily  focuses  on  the  pitch  modification  of  speech.  However,  since 
some  pitch  modification  techniques  simultaneously  modify  the  time  scale  of  a  speech  signal 
(and  vice  versa),  the  report  also  briefly  covers  the  time-scale  modification  literature.  Based 
on  the  claims  made  in  the  literature,  this  report  recommends  four  techniques  for  further 
consideration — namely,  the  pitch-synchronous  overlap-add  technique;  the  sinusoidal  analy¬ 
sis/synthesis  system  of  Quatieri  and  McAulay;  the  harmonic  plus  noise  model  of  Laroche, 
Stylianou,  and  Moulines;  and  the  time-domain  harmonic  scaling  technique.  All  four  tech¬ 
niques  are  capable  of  modifying  both  the  pitch  and  the  time  scale  of  speech  signals. 

The  problems  of  pitch  modification  and  time-scale  modification  are  perhaps  best  dis¬ 
cussed  in  the  context  of  a  simple  source/filter  model  of  speech  signals.  In  such  a  model, 
the  speech  signal  is  considered  to  be  the  output  of  a  linear  time- varying  filter  excited  by  a 
time-varying  source.  The  time-varying  source  consists  of  two  major  components  in  changing 
proportions — namely,  a  quasiperiodic  pulse  train  and  a  noise-like  portion.  When  the  source 
consists  of  mostly  the  pulse  train  component,  the  output  speech  is  quasiperiodic  and  con¬ 
sidered  to  be  “voiced.”  When  the  source  consists  of  mostly  the  noise-like  component,  the 
output  speech  is  noise-like  and  considered  to  be  “unvoiced.” 

In  general,  a  segment  of  speech  has  mixed  voicing;  in  other  words,  the  source  consists  of 
both  a  quasiperiodic  pulse  train  portion  and  a  noise-like  portion.  For  segments  with  mixed 
voicing,  the  filter  applied  to  the  noise-like  portion  of  the  source  may  be  different  from  the 
filter  applied  to  the  quasiperiodic  pulse  train.  Thus,  a  speech  signal,  s(t),  can  be  written  as 

s(t)=  [  (t)uv {t  -  t)  dr  +  [  hf(T)uu(t-T)dr,  (1) 

J  —  oo  «/- oo 


where 

•  uv(t )  is  the  voiced  portion  of  the  source  (i. e.,  the  quasiperiodic  pulse  train), 

•  h'f  (t)  is  the  linear  time- varying  filter  that  acts  on  the  voiced  portion  of  the  source, 

•  uu(t)  is  the  unvoiced  portion  of  the  source  ( *. e.,  the  noise-like  portion  of  the  source), 
and 

•  (t)  is  the  linear  time-varying  filter  that  acts  on  the  unvoiced  portion  of  the  source. 

The  pitch  modification  problem  can  be  viewed  in  both  a  narrow  sense  and  a  wide  sense. 
In  a  narrow  sense,  pitch  modification  consists  of  modifying  the  time-varying  fundamental 
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frequency  of  uv  (t)  while  making  no  other  changes  to  uv(t).  In  a  wider  sense,  pitch  modifi¬ 
cation  consists  of  modifying  the  time- varying  fundamental  frequency  while  also  judiciously 
modifying  the  time- varying  spectral  envelope  of  uv{t). 

In  general,  the  speech  model  of  Equation  1  is  not  unique.  The  spectral  shape  of  s{t)  is 
partially  due  to  the  sources  and  partially  due  to  the  filters,  and  the  contributions  of  either 
the  sources  or  the  filters  to  the  overall  spectral  shape  of  the  signal  can  be  modified  as  long 
as  other  factors  are  adjusted  to  compensate  for  these  modifications.  In  particular,  one  can 
modify  the  spectral  shape  of  uv(t)  and  still  obtain  the  same  s(t)  provided  that  one  modifies 
the  spectral  shape  of  h\ (t)  properly.  In  the  literature,  various  different  shapes  have  been 
tried  for  uv(t),  ranging  from  a  simple  impulse  train  as  used  in  standard  linear  predictive 
coding  to  more  complex  pulse  trains  representing  the  glottal  flow.  An  example  of  the  latter 
is  shown  in  Figure  1(a). 

When  uv (t)  is  a  simple  impulse  train,  pitch  modification  consists  of  simply  changing 
the  spacings  of  the  impulses.  However,  when  the  pulse  shapes  in  uv(t)  are  more  complex, 
modifications  of  the  pulse  shape  (and  hence  of  the  spectral  envelope  of  uF(f))  may  also 
be  necessary.  For  example,  Figure  1(b)  shows  that  moving  pulses  close  together  without 
modifying  the  shape  of  the  pulses  can  eliminate  the  closed  phase  of  the  glottal  flow.  (The 
closed  phase  in  each  period  of  the  waveform  is  the  portion  with  zero  magnitude.)  On  the 
other  hand,  Figure  1(c)  shows  a  number  of  pulses  of  the  same  fundamental  frequency  as  those 
in  Figure  1(b),  but  with  the  pulse  shape  compressed.  In  this  case,  a  closed-phase  portion 
exists  for  each  period  of  the  waveform.  The  point  of  this  example  is  that  the  wider  view 
of  pitch  modification  as  consisting  of  both  modifications  to  the  time-varying  fundamental 
frequehcy  and  modifications  to  the  the  time- varying  spectral  envelope  of  uv(t)  has  certain 
merits.  This  report  considers  the  pitch  modification  problem  in  the  wider  sense. 

The  various  pitch  modification  techniques  in  the  literature  fall  into  two  main  categories — 
namely,  frequency-scaling  techniques  and  frequency-resampling  techniques.  Frequency- scaling 
techniques  modify  both  the  fundamental  frequency  and  the  spectral  envelope  of  uv(t),  while 
frequency-resampling  techniques  attempt  to  modify  only  the  fundamental  frequency  of  uX  [t'). 
The  following  definitions  serve  to  delineate  the  two  categories. 


Frequency  Resampling:  The  process  of  changing  the  fundamental  frequency 
of  a  quasiperiodic  segment  of  a  signal  without  modifying  the  segment’s  time  dura¬ 
tion  or  its  spectral  envelope. 

Frequency  Scaling:  The  process  of  compressing  or  expanding  the  spectrum 
of  a  signal  without  modifying  the  time  duration  of  the  signal.  In  other  words, 
frequency-scale  modification  is  the  process  of  compressing  or  expanding  the  short- 
time  Fourier  transform  (STFT)  of  a  speech  signal  only  along  the  frequency  axis. 
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(a) 


(b) 


(c) 

Figure  1:  Source  Waveforms  Showing  (a)  Plot  of  the  Glottal  Flow  for  a  Voiced  Phoneme, 
(b)  Pitch  Modification  of  the  Original  Glottal  Flow  Waveform  Accomplished  by  Overlapping 
and  Adding  the  Individual  Pulses,  and  (c)  Pitch  Modification  of  the  Original  Glottal  Flow 
Waveform  Accomplished  by  Compressing  the  Pulses  in  Time 
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(b) 


(c) 


Figure  2:  Spectra  of  (a)  an  Unmodified  Signal,  (b)  a  Signal  Frequency- Scaled  by  a  Factor  of 
/?,  and  (c)  a  Signal  After  Application  of  Frequency  Resampling  by  a  Factor  of  (3 


Note  that  it  is  not  clear  how  closely  the  spectral  envelope  modifications  of  frequency  scaling 
resemble  those  required  to  properly  transform  uv(t). 

Figure  2  illustrates  the  difference  between  frequency  resampling  and  frequency  scaling. 
Figure  2(a)  shows  the  spectrum  of  an  unmodified  periodic  signal.  The  spectral  envelope  of 
the  signal  contains  two  peaks  with  a  frequency  spacing  of  Fe-  The  fundamental  frequency 
is  Fo.  Scaling  the  frequency  axis  by  a  factor  of  /?  results  in  the  modified  spectrum  shown 
in  Figure  2(b).  Here,  the  spacing  between  the  two  peaks  of  the  spectral  envelope  is  0Fe , 
and  the  modified  fundamental  frequency  is  f3F0.  Thus,  frequency  scaling  modifies  both 
the  harmonic  structure  and  the  spectral  envelope  of  the  signal.  Resampling  the  frequency 
axis  by  a  factor  of  (3  results  in  the  modified  spectrum  shown  in  Figure  2(c).  The  modified 
fundamental  frequency  is  again  /3F0 ,  but  the  spectral  envelope  is  the  same  as  that  of  the 
original  signal.  In  particular,  the  spacing  between  the  two  peaks  in  the  spectral  envelope  is 
the  original  spacing  of  Fe- 

For  completeness,  a  definition  of  the  time-scale  modification  process  is  as  follows. 


Time-Scale  Modification:  The  process  of  compressing  or  expanding  the  time 
duration  of  a  signal  without  modifying  the  apparent  local  frequency  content  of 
the  signal.  In  other  words,  time-scale  modification  is  the  process  of  compressing 
or  expanding  the  STFT  of  a  speech  signal  only  along  the  time  axis. 
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MAJOR  PITCH  MODIFICATION  TECHNIQUES 


This  section  presents  the  major  pitch  modification  techniques:  techniques  based  on  simple 
linear  prediction  models  for  speech;  the  multiband  excitation  vocoder;  the  time-domain 
harmonic  scaling  method  and  linear  prediction;  the  sinusoidal  analysis/synthesis  system  of 
Quatieri  and  McAulay;  the  harmonic  plus  noise  model  of  Laroche,  Stylianou,  and  Moulines; 
and  the  pitch-synchronous  overlap-add  method. 


Basic  Linear  Predictive  Coding  Models  of  Speech 


Basic  linear  predictive  coding  (LPC)  models  of  speech  provide  some  of  the  most  straight¬ 
forward  ways  for  modifying  the  pitch  of  a  speech  signal  [1-5].  Figure  3  shows  an  LPC-based 
pitch  modification  method.  First,  the  system  performs  LPC  analysis  on  short  segments  of 
the  original  signal  resulting  in  a  set  of  filter  coefficients  for  each  segment  and  a  residual 
signal.  (Filtering  the  residual  signal  with  the  time- varying  filter  coefficients  returns  the  orig¬ 
inal  signal.)  Second,  the  system  forms  a  synthetic  excitation  signal,  where  the  excitation 
signal  consists  of  white  noise  for  the  unvoiced  segments  and  periodic  impulse  trains  for  the 
voiced  segments.  The  spacing  between  the  impulses,  To,  varies  according  to  the  desired 
pitch,  T0  =  jr.  Third,  the  system  forms  the  pitch-modified  signal  by  filtering  the  synthetic 
excitation  signal  with  the  time-varying  linear  filter  formed  by  the  LPC  coefficients. 

Although  this  method  is  conceptually  simple  and  easy  to  implement,  the  output  speech 
has  a  distinct  synthetic  quality  due  to  the  simplified  excitation  model  [1,3-5].  This  fact 
has  prompted  researchers  to  develop  excitation  models  that  produce  more  natural  sounding 
speech.  In  [3],  Milenkovic  models  the  voiced-speech  excitation  as  shown  in  Figure  4.  Pitch 
modification  using  this  excitation  could  be  accomplished  by  varying  the  spacing  between 
the  pulses.  In  multipulse  LPC  models  [6-8],  the  excitation  consists  of  a  small  number 
of  impulses  (generally  eight  to  10)  over  each  10  msec  frame.  In  [9],  Caspers  and  Atal 
investigated  two  different  methods  for  modifying  the  pitch  of  multipulse  LPC  speech.  The 
first  method  modified  the  length  of  individual  pitch  periods  by  linearly  scaling  the  time  axis 
of  the  multipulse  excitation.  The  second  method  modified  the  length  of  individual  pitch 
periods  by  adding  or  subtracting  zeros  in  the  excitation.  Both  methods  modified  the  length 
of  the  excitation,  so  pitch  periods  were  added  or  subtracted  from  the  excitation  in  order  to 
obtain  an  excitation  of  the  same  length  as  the  original  excitation.  Caspers  and  Atal  found 
that  the  second  pitch  modification  method  (*.e.,  the  addition  or  subtraction  of  zeros  in  the 
excitatioij)  introduced  very  little  distortion,  while  the  linear  scaling  of  the  time  axis  produced 
significantly  more  distortion. 
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Figure  3:  Pitch  Modification  Based  on  Linear  Predictive  Coding  Analysis  of  the  Speech 
Signal 


Figure  4:  Sketch  of  the  Voiced-Speech  Excitation  Signal  Used  by  Milenkovic  (1993) 
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The  Multiband  Excitation  Vocoder 


As  noted  in  [1,2,10],  voiced-speech  segments  generally  contain  both  harmonic  and  inhar¬ 
monic  components,  and  modeling  voiced-speech  segments  as  purely  harmonic  signals  leads 
to  synthetic  speech  with  a  buzzy  quality.  For  these  reasons,  Griffin  and  Lim  developed  the 
multiband  excitation  (MBE)  vocoder  [1,2,10].  The  MBE  vocoder  divides  the  spectrum  of 
the  excitation  into  a  number  of  adjacent  frequency  bands  with  each  band  labeled  as  voiced 
or  unvoiced.  Thus,  the  voiced-speech  excitation  model  for  the  MBE  vocoder  contains  both 
voiced  and  unvoiced  components. 

Figure  5  shows  a  diagram  of  a  pitch  modification  system  based  on  the  MBE  vocoder. 
Over  small  speech  segments,  the  system  estimates  the  spectral  envelope  and  the  pitch  of 
the  speech  and  makes  voiced/un  voiced  decisions  about  the  speech.  The  system  divides  the 
spectrum  of  the  speech  into  several  adjacent  frequency  bands,  and  makes  a  voiced/unvoiced 
decision  for  each  band.  These  voiced/unvoiced  decisions  yield  a  voicing  indicator  function 
in  the  frequency  domain,  where  the  value  is  one  for  voiced  bands  and  zero  for  unvoiced 
bands.  Using  the  desired  pitch,  the  system  creates  the  voiced  portion  of  the  excitation  by 
including  those  harmonics  of  the  desired  fundamental  frequency  that  fall  within  the  voiced- 
speech  regions  as  indicated  by  the  voicing  indicator  function.  The  system  forms  the  unvoiced 
portion  of  the  excitation  by  multiplying  the  spectrum  of  a  sample  white  noise  sequence  with 
a  function  indicating  the  unvoiced  frequency  bands  ( i.e .,  a  function  that  has  a  value  of  one 
for  unvoiced  frequency  bands  and  a  value  of  zero  for  voiced  frequency  bands).  Finally,  the 
system  forms  the  output  speech  signal  by  scaling  the  excitation  spectrum  (the  sum  of  the 
voiced  and  unvoiced  portions)  by  the  estimated  spectral  envelope. 


Sampling-Rate  Conversion  and  LPC 


To  improve  the  quality  of  speech  from  LPC-based  systems,  one  can  use  the  LPC  residual 
signal  as  the  excitation  for  the  time- varying  filter  given  by  the  LPC  coefficients.  One  generally 
does  not  use  the  original  residual  as  the  excitation  signal  in  coding  applications,  because  the 
residual  requires  several  bits  to  code.  However,  reducing  the  bit  rate  is  of  little  concern  in  the 
pitch  modification  problem,  so  one  can  use  the  residual  as  the  excitation  to  greatly  improve 
the  quality  of  the  output  speech. 

Conversion  of  the  sampling  rate  of  the  LPC  residual  of  a  speech  signal  is  one  way  to  change 
the  pitch  of  the  signal.  Figure  6  shows  a  diagram  of  a  pitch  modification  system  based  on 
modifying  the  LPC  residual.  Suppose  that  the  LPC  residual  is  originally  sampled  at  a  rate 
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Figure  5:  A  Pitch  Modification  System  Based  on  the  Multiband  Excitation  Vocoder 
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Figure  6:  Diagram  of  a  Pitch  Modification  System  Based  on  Modifying  the  LPC  Residual 

of  Fs  in  samples/second,  and  suppose  that  a  system  changes  the  sampling  rate  to  Fn.  If  the 
system  outputs  points  from  the  resampled  residual  at  the  new  rate  of  F/v  samples/second, 
then  the  output  residual  is  a  lowpass-filtered  version  of  the  original  residual.  However,  if 
the  system  outputs  points  from  the  converted  residual  at  the  original  sampling  rate  of  Fs 
points/second,  then  the  residual  is  scaled  in  frequency  by  the  factor  /?  =  jF.  Note  that  this 
procedure  changes  the  duration  of  the  signal,  so  a  time-scale  modification  technique  must 
be  used  in  conjunction  with  this  technique.  Methods  for  sampling-rate  conversion  can  be 
found  in  [11-19]. 

There  are  two  main  drawbacks  to  using  sampling-rate  conversion  as  a  means  for  pitch 
modification.  The  first  drawback  is  that  sampling-rate  conversion  scales  the  frequency 
response  of  the  unvoiced  portions  of  speech  as  well  as  the  voiced  portions.  The  second 
drawback  is  that  fine  time- varying  pitch  modification  is  difficult  to  perform  using  this  tech¬ 
nique. 


Time-Domain  Harmonic  Scaling  and  LPC 


Another  pitch  modification  method  based  on  modifying  the  LPC  residual  is  the  technique 
of  time-domain  harmonic  scaling  (TDHS)  [20-27].  TDHS  is  a  general  technique  for  frequency 
scaling  a  signal  (not  just  the  LPC  residual).  Techniques  similar  to  TDHS  have  been  developed 
in  [28,26]. 

The  TDHS-based  pitch  modification  system  has  the  same  functional  form  as  shown  in 
Figure  6.  The  processing  for  TDHS  depends  on  whether  one  wants  to  expand  or  compress 
the  frequency  spectrum.  The  steps  required  to  scale  the  spectrum  by  a  factor  of  /?  are  as 
follows: 
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•  Find  two  relatively  prime  integers  fj,  and  S  such  that  /?  «  Let  C  =  -.  For  frequency 
compression,  fi  <  6,  which  implies  that  C  >  1.  For  frequency  expansion,  6  <  //,  which 
implies  that  C  <  1. 

•  Find  the  pitch  period,  T0,  and  calculate  an  integer,  Np,  such  that  To  «  NpT,  where  T 
is  the  sampling  period.  Np  is  the  approximate  number  of  signal  samples  in  one  pitch 
period. 

•  Calculate  the  following  integer  values:  N  =  yNp  and  Nc  ~  ,  ■ 

|G— 1| 

•  Calculate  Tc  =  . 

•  Define  ac(n )  =  =  [7^  ,  where  n  is  some  integer  and  [-J  denotes  the  floor 

operator. 

•  Define  /itv(-)  =  Nh(-),  where  h(-)  is  a  sampled  version  of  an  analog  prototype  lowpass 
filter  (usually  chosen  to  be  a  triangular  window). 

•  Calculate 

'  m— 1 

i  x  (nT  +  ac(n)NpT  -  iNPT)  hN  ( iNcTc  +  ( n  mod  Nc)  Tc),  /?  <  1 

yc(nCT)=l  *=° 

y~! x  (nT  -  ac(n)NPT  -  iNPT )  hN  ( iNcTc  -  ( n  mod  Nc )  Tc),  >  1 

,  <=x 


where  y^(-)  is  the  frequency-scaled  residual,  x(-)  is  the  original  residual,  and  m  is  a  small 
integer  (generally,  m  =  2).  For  /?  <  1,  Nc  output  samples  are  computed  for  every  Nc  +  Np 
input  samples.  For  j3  >  1,  Nc  output  samples  are  computed  for  every  Nc  —  Np  input 
samples. 

The  TDHS  technique  is  capable  of  both  pitch  modification  and  time-scale  modification. 
If  samples  of  yc  ( nCT )  are  output  at  the  rate  ^ ,  then  the  modified  residual  has  a  scaled 
bandwidth  and  the  same  duration  as  the  original  residual.  If  samples  of  yc  ( nCT )  are  output 
at  the  rate  then  the  modified  residual  has  the  same  bandwidth  as  the  original  residual 
but  a  modified  time  scale  compared  with  the  original  residual.  Note,  however,  that  the  pitch 
and  time-scale  modifications  are  often  approximate  due  to  the  approximations  noted  in  the 
above  steps. 


General  STFT-Based  Pitch  Modification  Techniques 

There  are  many  frequency  scaling  techniques  based  on  the  short-time  Fourier  transform 
[5,30-44].  These  techniques  calculate  the  STFT  of  a  signal,  modify  the  amplitudes  and 
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phases  of  the  STFT,  and  calculate  the  inverse  of  the  modified  STFT  to  form  the  modified 
signal.  We  can  apply  these  techniques  to  the  linear  prediction  residual  as  in  Figure  6  and 
often  directly  to  the  speech  signal  itself  in  order  to  modify  the  pitch  of  the  speech  signal. 
These  techniques  often  generate  reverberation  and  other  artifacts  [45].  The  reverberation 
arises  from  improper  phase  modifications  to  the  excitation  spectrum,  while  some  of  the  other 
artifacts  arise  from  modifying  the  spectrum  of  the  unvoiced  portions  of  the  speech  signal. 
The  next  subsection  outlines  an  STFT-based  system  that  adjusts  the  phases  in  order  to 
greatly  reduce  the  degree  of  reverberation  in  the  output  signal. 

The  Sinusoidal  Analysis/ Synthesis  System  of  Quatieri 

and  McAulay 

The  sinusoidal  analysis/synthesis  (SAS)  system  of  Quatieri  and  McAulay  is  a  versatile 
system  for  affecting  time-scale,  frequency-scale,  and  pitch  modifications  on  speech  signals 
[45].  Quatieri  and  McAulay  developed  the  system  in  a  series  of  papers  [45-51],  while  Serra 
and  Smith  independently  developed  a  similar  system  in  [52]. 

The  SAS  model  of  the  speech  signal  is 

m 

s(t)  =  J2aj{t)MMt),t)cos[uj{t){t-tPi)  +  'l’Mt),t)], 
i= i 

where 

•  N(t)  is  the  number  of  frequency  components  at  time  t, 

•  =  2tt fj(t)  is  the  jth  angular-frequency  component  at  time  f, 

•  tPi  is  the  most  recent  pitch  onset  time  for  time  t, 

•  a,j(t )  is  the  amplitude  of  the  j th  frequency  component  of  the  excitation  at  time  t, 

•  M(u>j(t),t )  is  the  magnitude  of  the  vocal  tract  frequency  response  evaluated  at  the 
frequency  for  time  t,  and 

•  ip  ( u)j{t ),  t)  is  the  phase  of  the  vocal  tract  response  evaluated  at  the  frequency  u>j(t)  for 
time  t. 

• 

Note  that  the  pitch  onset  time  is  the  time  for  which  all  of  the  excitation  sinusoids  add 
coherently  prior  to  their  modification  by  the  vocal  tract  response.  Also  note  that  the  pitch 
onset  times  relate  to  each  other  through  the  time-varying  pitch  period,  To,  as 

tPi  =  <Pi_,  +  T0. 
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Let 


^  ("»(<),*)  =  aj{t)M(wj{t),  t), 

=  «i(*)  (*-<*)  +  *  MO,  0, 


(2) 

(3) 


then  s(t)  is  more  compactly  written  as 

N(t) 

=  lLA  (wi(*)>  <)  cos  P  ("*(*)>  f)] . 

i=i 

The  SAS  system  determines  the  various  components  in  Equations  2  and  3  as  follows.  From 
the  original  speech  signal,  the  SAS  system  estimates  the  values  of  A(ujj,t)  and  0(u)j(t),t),  as 
well  as  the  fundamental  frequency  and  pitch  onset  times.  For  each  j,  the  system  computes 
the  various  quantities  in  Equations  2  and  3  as 

«i(*)  =  1,  (4) 

=  A(uj(t),t),  (5) 

(6) 

To  modify  the  pitch  by  a  factor  of  /?,  the  SAS  system  modifies  the  signal  as  follows: 

min(JV(t),p?JV(t)J) 

*"W  E  (ffu ,,■(<),  i)  cos  [«“  (fiujit),  i)j  , 

3=1 

where 

AM(pUi(t),t)  =  f),  (7) 

eM  {pUj(t),t)  =  (8) 

and  the  M  superscript  indicates  a  modified  quantity.  The  modified  pitch  onset  times  are 

related  to  each  other  through  the  modified  time-varying  pitch  period,  as 

_  +M  »  rpM 

lPi  -  ZPi~  1  +  10  • 

The  pitch  modification  works  as  follows.  The  SAS  system  applies  frequency  resampling  to 
the  vocal  tract  components.  The  system  is  based  on  the  assumption  that  the  vocal  tract  filter 
is  linear,  therefore,  the  vocal  tract  filter  outputs  components  of  the  same  frequencies  as  those 
input.  After  pitch  modification,  the  frequencies  that  enter  the  vocal  tract  filter  are  different 
from  those  that  originally  enter  it.  Because  of  this,  and  must  be  resampled 

along  the  frequency  axis  at  the  points  f3uj(t)  for  j  =  1, . . .  ,min(JV(<),  [£JV(f)J).  The  SAS 
system  modifies  the  amplitudes  of  the  excitation  components  by  frequency  scaling.  The 
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amplitude  of  the  jth  harmonic  remains  the  same  after  pitch  modification,  but  the  frequency 
of  that  component  is  different  after  pitch  modification.  This  is  the  reason  that  aj(t)  is  a 
function  of  j  and  not  of  0Jj(t).  Finally,  the  SAS  system  modifies  the  phases  of  the  excitation 
components  by  aligning  all  of  the  components  on  a  new  set  of  pitch  marks.  This  excitation 
phase  modification  is  expressed  in  the  term  of  Equation  8. 

Figure  7  shows  a  diagram  of  the  SAS  system  for  pitch  modification.  The  processing 
proceeds  as  follows. 

1.  Estimate  the  pitch,  F0,  of  the  sampled  input  signal,  s(k),  using  the  technique  of  [48]; 
here,  k  denotes  the  time  index  of  the  sampled  signal. 

2.  The  pitch  period,  To  =  j=r,  determines  the  length  of  a  data  buffer  and  a  window,  u>,  to 
be  applied  to  the  buffered  data,  SB(k).  The  window  and  buffer  lengths  are  2.5To. 

3.  Apply  the  window  to  the  buffered  data,  and  sample  the  windowed  data  vector  every 
10  msec.  Quatieri  and  McAulay  found  that  a  frame  rate  of  10  msec  produces  high- 
quality  reconstruction. 

4.  Calculate  the  1024-point  Fast  Fourier  Transform  (FFT)  of  the  windowed  data  vector. 
Denote  the  FFT  of  the  mth  frame  of  data  by  S(u>,  m).  Here,  w  denotes  radian  frequency. 

5.  Determine  the  magnitude  of  S(u),  m)  at  the  center  frequency  of  each  of  the  1024  bins 
comprising  the  FFT. 

6.  Pick  the  peaks  of  the  magnitude  vector,  and  determine  the  frequencies  that  correspond 
to  these  peaks.  Denote  the  vector  of  peak  magnitudes  by  A(u,m )  and  the  vector  of 
frequencies  that  correspond  to  the  peaks  of  the  magnitude  vector  by  /(m). 

7.  Determine  the  phase  of  S(u),m)  at  each  of  the  frequencies  given  in  /(m);  denote  this 
phase  vector  by  0(u,m). 

8.  Determine  aj(t),  M(u>,t ),  and  if>(u,t)  using  Equations  4,  5,  and  6. 

9.  Modify  the  phases  and  amplitudes  in  order  to  scale  the  pitch  by  a  factor  of  /?.  This 
involves  using  Equations  7  and  8.  Denote  the  modified  phase  vector  by  6M(u,  m)  and 
the  modified  amplitude  vector  by 

10.  Interpolate  the  amplitudes  and  phases  on  a  frame-to-frame  basis.  This  requires  the 
matching  of  the  frequencies  of  each  frame  with  those  of  the  next  frame  and  uses  the 
nearest-neighbor-based  algorithm  of  [47]. 

11.  Generate  unit  magnitude  sinusoids  using  the  frequencies  and  interpolated  phases,  then 
multiply  each  sinusoid  by  the  corresponding  interpolated  amplitude. 

12.  Sum  the  sinusoids  to  form  the  pitch-modified  signal,  sM(k). 
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Figure  7:  Pitch  Modification  of  Speech  Using  the  Sinusoidal  Analysis/Synthesis  System  of 
Quatieri  and  McAulay  (1992) 
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Quatieri  and  McAulay  indicate  in  [45]  that  the  SAS  pitch  modification  system  yields 
speech  that  is  generally  free  of  artifacts,  although  a  certain  unnatural  quality  is  present  in 
the  unvoiced  sounds  due  to  the  scaling  of  the  unvoiced  frequency  components.  They  also 
indicate  that  some  hoarseness  is  present  for  pitch  modifications  much  greater  than  20%. 


The  Harmonic  Plus  Noise  Model  of  Laroche, 
Stylianou,  and  Moulines 


As  noted  in  the  subsection  concerning  the  multiband  excitation  vocoder,  modeling  voiced- 
speech  segments  as  purely  harmonic  signals  results  in  synthesized  speech  with  a  buzzy  quality. 
For  this  reason,  mixed-voicing  models  of  speech  such  as  the  multiband  excitation  vocoder 
have  recently  been  proposed.  This  subsection  outlines  a  mixed- voicing  model  called  the 
harmonic  plus  noise  model  (HNM)  [53],  which  is  attributed  to  Laroche,  Stylianou,  and 
Moulines. 

The  HNM  system  is  based  on  a  time- varying  sinusoidal  model  for  the  speech  signal,  s(t). 
The  sinusoidal  model  is 


N(tp,) 

s(t)  =  A»(<)  exp  (jnu>0  (tPi)  t )  +  ss(t), 

n=-N(tPi) 

where  N  (tPi)  is  the  number  of  harmonics  in  the  signal  at  pitch  onset  time  tPi,  u/0  (tPi)  =  2 rF0 
is  the  radian  frequency  corresponding  to  the  fundamental  frequency  at  time  tPi ,  An(t)  is  the 
complex  time-varying  amplitude  of  the  nth  harmonic,  and  ss(t )  is  a  stochastic  component. 
The  complex  amplitudes  are  of  the  form: 

An(t)  =  dn  (tpi)  "h  (t  tpi)  bn  (tpi}  > 

where  an  (tPt)  is  the  complex  amplitude  of  the  nth  harmonic  at  tPi  and  bn  (tPi)  is  the  complex 
slope  of  the  amplitude  of  the  nth  harmonic  at  tPi.  After  some  algebraic  manipulation,  the 
sinusoidal  model  can  be  written  in  the  following  form: 

W) 

s(t)  =  2\an{tPi)\cos(nuo(tPi)t  +  0an(tPi)) 

n=l 

N(t) 

+  (t-  tpi )  2  \hn  (<w)l  C°S  (nw o  (tPt)  t  +  ehn  (<w)) 

n=l 

+Oo  (tpi)  +  (t  -  tpi)  bo  ( tPi )  +  3s(t), 
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where  6an  (tPi )  is  the  phase  of  an  (tPi)  and  0bn  (tPi)  is  the  phase  of  bn  (tPi).  Note  that  a0(tPi) 
and  bo  (tPi)  are  both  real- valued.  Let 


V’an  {tpi )  =  ^a„  ( tpi )  +  kuo  ( tpi )  tpi , 

{tpi)  =  @bn  ( tpi )  d"  klO 0  (tp, )  tp,i 

Ban{tpi)  =  2  \an  (iPj)|  cos  (V’an  {tpi))  ? 

Bbn{tpi)  =  2  \bn  (fPi)|  cos  (xpbn  (tPi)) , 

^o„  (^Pi)  =  — 2  |°n  (tpi)|  sin  (lpan  ( tpi ))  ; 

Bb„(tpi)  =  — 2  |6n  (tPi)|  sin  (^)jn  (<Pt)) , 

(9) 

then  s(t)  can  be  written  as 

m 

s(t)  ~  53  \Ban  (tpi)  +  {t  —  tpi)  B£n  (tPi)}  cos  (nwo  (tPi)  (t  —  tPi)) 

n=l 

m 

+  53  (tpi)  +  (t~  tpi)  Bb„  (V)}  sin  (nUJ0  (tPi)  ( t  -  tpi)) 

n=l 

+ao  (tPi)  +  (t  —  tpi)  bo  (tPi)  +  ss(t).  (10) 

The  HNM  system  modifies  the  pitch  in  the  following  manner.  Let 

Bcn{t)  =  Bcan(tPi)  +  {t-tPi)B°n{tPi), 

BSM  =  Bl  ((„)  +  «-  «*) 

Bo(t)  =  ao  (tPi) (t  —  tPi)  bo  (tPi)  , 
then  s(t)  can  be  written  as 

m 

=  53  {BnWcos(nw0(<PJ(<-fp,)) 

71=1 

+Bn(t) sin  (nu0  (tPi)  (t  -  <Pj))}  +  B0(t)  +  ss(t). 

To  modify  the  pitch  by  a  factor  of  /?,  the  HNM  system  forms  the  modified  signal,  sM(t),  as 

min(7V(t),L^(0J) 

sM(t)  =  £  {^n  M  COS  (nfiuio  (tPi)  (t  -  ijf)) 

71=1  '  7 

+B^(t)  sin  ( nfiujo  (tPi)  (*-**))}  +  #o(*)  +  «$(<). 
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Figure  8  shows  a  diagram  of  a  pitch  modification  system  based  on  the  HNM.  The  system 
first  estimates  the  time- varying  pitch  of  the  speech  signal  and  the  pitch  onset  times  using,  for 
example,  the  algorithm  in  [54] .  At  every  pitch  onset  time,  the  system  estimates  the  parame¬ 
ters  of  a  sinusoidal  model  ( i.e .,  the  B°n  (tPi),  J9£  (tPi),  (tPi),  J5fn  (tPi),  a0  (tPi),  and  b0  (tPi) 
parameters  of  Equation  10)  of  the  speech  over  a  frame  of  length  2To,  where  To  is  the  pitch 
period.  The  HNM  system  estimates  the  parameter  vector,  x,  using  a  least-squares  technique. 
The  system  then  synthesizes  the  sampled  estimated  deterministic  component,  so(kTs),  of 
the  speech  signal  from  x  and  subtracts  S£>(kTs )  from  the  original  signal  to  yield  the  stochastic 
portion  of  the  speech  signal,  ss(kTs).  If  the  frame  is  voiced,  the  system  modifies  x  in  order 
to  achieve  the  desired  pitch  modification,  and  it  uses  the  modified  parameter  vector,  xM, 
to  synthesize  an  estimate  of  the  pitch-modified  deterministic  component,  s^(fcTs).  If  the 
frame  is  unvoiced,  then  the  system  does  not  modify  x.  Finally,  the  system  adds  ss(kTs)  and 
Sp(kTs)  to  form  the  total  pitch-modified  signal,  sM(kTs).  Note  that  the  s(kTs)  notation  is 
shortened  to  s(k )  in  the  figure. 

It  is  instructive  to  compare  the  HNM  system  and  the  SAS.  Both  systems  use  sinusoidal 
models  with  time-varying  amplitudes.  The  HNM  system  uses  a  time-domain  least-squares 
technique  to  estimate  the  model  parameters,  while  the  SAS  uses  a  frequency- domain  tech¬ 
nique  based  on  the  STFT.  The  SAS  modifies  both  the  stochastic  and  deterministic  portions 
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of  the  signal,  while  the  HNM  system  modifies  only  the  deterministic  portion  of  the  signal. 
Quatieri  and  McAulay  have  indicated  that  modifying  the  stochastic  portion  in  the  course  of 
modifying  the  pitch  often  leads  to  artifacts  in  the  pitch-modified  speech. 

One  of  the  major  drawbacks  of  the  HNM  system  is  its  large  computational  burden. 
Using  a  least-squares  technique,  the  system  estimates  four  parameters  for  each  harmonic  (see 
Equation  10)  plus  the  two  offset  parameters  (a0(tPi)  and  bQ(tPt)).  Least-squares  problems 
are  most  accurately  solved  using  the  singular  value  decomposition  (SVD);  however,  the  SVD 
is  quite  computationally  burdensome  for  large  numbers  of  parameters.  Given  a  male  speaker 
with  an  average  pitch  of  100  Hz  and  a  bandwidth  (BW)  of  8  kHz,  the  HNM  system  estimates 
on  the  order  of  318  ^4  [^~J  —  2^  parameters  for  each  pitch  period.  A  female  speaker  with 
an  average  pitch  of  200  Hz  and  a  bandwidth  of  8  kHz  requires  approximately  half  the  number 
of  parameters  per  pitch  period  compared  to  the  male  speaker.  However,  the  female  speaker 
has  twice  the  number  of  pitch  periods  for  a  given  time  period  compared  to  the  male  speaker, 
so  the  computational  burden  is  still  high  for  the  female  speaker.  Of  course,  lowering  the 
speech  bandwidth  lowers  the  number  of  harmonics  present  in  the  speech,  which  in  turn 
lowers  the  computational  burden  of  the  HNM  system  for  both  male  and  female  speakers; 
however,  lowering  the  bandwidth  also  lowers  the  quality  of  the  speech. 


The  Pitch-Synchronous  Overlap- Add  Method 


This  subsection  outlines  the  pitch-synchronous  overlap-add  (PSOLA)  method  of  pitch 
modification  [4,55—57].  The  PSOLA  pitch  modification  technique  consists  of  windowing 
segments  of  the  original  speech  signal  and  placing  these  windowed  segments  in  a  (modified) 
output  signal.  In  general,  the  speech  segments  have  a  different  degree  of  overlap  in  the 
modified  signal  than  they  have  in  the  original  signal,  and  it  is  this  change  in  the  degree  of 
overlap  that  affects  the  change  in  pitch. 

The  PSOLA  technique  builds  the  output  signal  from  several  windowed  segments  of  the 
input  signal  as  follows.  Assume  that  the  output  signal  initially  consists  of  a  zero-valued 
sequence,  then  the  PSOLA  process  adds  windowed  segments  of  the  original  signal  to  the 
output  one  segment  at  a  time.  In  performing  the  overlap-add  (OLA)  process  on  voiced- 
speech  segments,  the  PSOLA  technique  makes  extensive  use  of  pitch  onset  times.  The 
PSOLA  technique  windows  the  speech  signal  about  the  pitch  onset  times  for  the  original 
signal  and  uses  the  OLA  process  to  place  the  windowed  segments  about  the  pitch  onset  times 
for  the  output  signal.  The  algorithm  described  in  [54]  is  one  popular  way  to  estimate  the 
pitch  onset  times  for  the  original  speech  signal.  For  the  output  signal,  one  places  the  pitch 
onset  times  to  affect  the  desired  pitch  modification. 

Note  that  in  order  to  keep  the  length  of  the  output  signal  approximately  equal  to  the 
length  of  the  original  signal,  one  may  need  different  numbers  of  pitch  onset  times  for  the 
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original  and  output  speech  signals.  Figure  9  gives  two  examples  of  applying  the  PS  OLA 
technique  that  yield  different  numbers  of  pitch  onset  times.  Assume  that  the  original  signal 
has  a  voiced-speech  segment  with  four  pitch  onset  times  (denoted  by  the  small  vertical  arrows 
in  Figure  9).  Figure  9(a)  illustrates  a  case  where  a  windowed  segment  from  the  original  signal 
is  not  used  in  the  output.  Figure  9(b)  illustrates  a  case  where  windowed  segments  from  the 
original  signal  are  repeated  in  the  output. 

Moulines  and  Charpentier  indicate  in  [4]  that  the  PSOLA  technique  can  introduce  arti¬ 
facts  in  the  pitch-modified  speech.  These  artifacts  include  tonal  noises,  hoarseness,  and 
reverberation.  It  has  been  the  author’s  experience  that  the  PSOLA  technique  provides 
excellent  quality  pitch  modification  for  pitch  changes  less  than  approximately  25%. 
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Figure  9:  Conceptual  Diagrams  of  the  Pitch-Synchronous  Overlap-Add  Method  for  (a) 
Decreasing  the  Pitch  and  (b)  Increasing  the  Pitch 
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MAJOR  TIME-SCALE  MODIFICATION 

TECHNIQUES 


This  section  very  briefly  discusses  selected  time-scale  modification  techniques.  In  particu¬ 
lar,  this  section  outlines  basic  cut-and-splice  methods,  the  synchronized  overlap-add  method, 
and  wavelet-based  time-scale  modification  methods.  Note  that  several  methods  discussed  in 
the  previous  section  in  connection  with  pitch  modification  can  also  be  used  for  time-scale 
modification.  This  section  does  not  review  those  techniques;  instead,  the  reader  is  referred 
to  the  discussions  in  the  previous  section  and  in  the  papers. 


Cut-and-Splice  Methods 


The  general  category  of  cut-and-splice  methods  encompasses  several  techniques;  however, 
the  basic  ideas  are  as  follows  [58-62].  To  reduce  the  duration  of  a  speech  signal  to  a  times 
the  original  duration,  remove  segments  of  size  (1  —  a)T  from  the  signal  every  T  seconds. 
To  expand  the  duration  of  a  speech  signal  to  (1  +  a)  times  the  original  duration,  repeat 
segments  of  size  aT  every  T  seconds.  While  this  method  is  conceptually  and  computationally 
simple,  it  results  in  poor  quality  speech  due  to  glitches  at  the  segment  boundaries  and  to 
the  interruption  of  the  local  regularity  of  the  speech  signal  [61].  In  an  effort  to  reduce  these 
effects,  Neuburg  [61]  propsed  a  system  in  which  segments  of  the  length  of  the  local  pitch 
period  are  discarded  or  repeated.  Jianping  [59]  proposed  a  similar  technique  designed  to 
reduce  the  possibility  of  discarding  or  repeating  phoneme  transitions.  As  noted  in  [63],  these 
techniques  often  introduce  a  “burbling”  distortion  in  the  output  speech. 


The  Synchronized  Overlap-Add  Method 


A  basic  OLA  method  for  time-scale  modification  consists  of  two  steps.  First,  window  the 
data  every  Sa  points  (i.e.,  space  the  windows  so  that  the  distance  between  the  beginning  of 
a  window  and  the  beginning  of  the  previous  window  is  Sa  points).  Second,  overlap  and  add 
the  windowed  segments  such  that  the  new  spacing  between  the  windowed  segments  is  Ss 
points.  Choose  Ss  <  Sa  to  compress  the  time  scale  and  Ss  >  Sa  to  expand  the  time  scale. 

The  basic  OLA  method  often  leads  to  artifacts  in  the  output  speech.  In  an  effort  to  reduce 
these  artifacts,  several  researchers  have  investigated  the  use  of  the  synchronized  overlap-add 
(SOLA)  method  [64-69].  The  SOLA  technique  differs  from  the  basic  OLA  in  that  the  spacing 
of  the  windows  for  the  output  speech  is  not  Ss  but  Ss  +  k(n),  where  k(n)  is  a  time- varying 
number  of  data  points  chosen  to  synchronize  successive  windowed  data  segments.  For  each 
segment  number,  n,  choose  k(n)  so  that  the  data  segments  add  coherently;  do  this  by  finding 
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the  k[n)  that  maximizes  the  cross  correlation  between  adjacent  windowed  data  segments. 
This  process  minimizes  many  of  the  glitches  found  in  the  output  speech  of  the  basic  OLA 
method. 


Wavelet-Based  Methods 

Time-scale  modification  techniques  based  on  wavelet  analysis  resemble  time-scale  mod¬ 
ification  techniques  based  on  the  STFT.  In  both  cases,  one  transforms  the  speech  signal, 
modifies  the  parameters  of  the  transformed  signal,  and  computes  the  inverse  transform  to 
form  the  output  signal.  Let  S(p,  r)  be  the  wavelet  transform  of  a  signal,  s(i),  then 

S(p,r) = ji  I-jicr) s{t)dt' 

where  g(-)  is  the  analyzing  wavelet  chosen  by  the  user.  The  wavelet  transform  of  s(at)  is 
related  to  the  wavelet  transform  of  s(t)  as  follows: 

s(ad)  — —  S  (cup,  cut)  . 

Wavelet  time-scale  modification  techniques  are  discussed  in  [70-75]. 
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RECOMMENDATIONS  AND  CONCLUSIONS 


This  report  has  summarized  the  literature  in  the  areas  of  pitch  modification  and  time-scale 
modification  of  speech.  Based  on  the  claims  made  in  the  literature,  this  report  recommends 
four  techniques  for  further  consideration — namely,  the  pitch-synchronous  overlap-add  tech¬ 
nique,  the  sinusoidal  analysis/synthesis  system  of  Quatieri  and  McAulay,  the  harmonic  plus 
noise  model  of  Laroche,  Stylianou,  and  Moulines,  and  the  time-domain  harmonic  scaling 
technique.  All  four  techniques  are  capable  of  modifying  both  the  pitch  and  the  time  scale 
of  speech  signals.  When  used  as  pitch  modification  techniques,  none  of  the  four  recom¬ 
mended  techniques  requires  separate  time-scale  modification  techniques.  However,  note  that 
the  time-domain  harmonic  scaling  technique  requires  that  the  output  signal  be  interpolated 
or  decimated  so  that  the  data  points  are  output  at  the  same  rate  as  the  original  speech 
signal.  The  pitch-synchronous  overlap-add  and  time-domain  harmonic  scaling  techniques 
have  the  advantage  of  requiring  considerably  less  computation  than  do  the  sinusoidal  analy¬ 
sis/synthesis  system  of  Quatieri  and  McAulay  and  the  harmonic  plus  noise  model  of  Laroche, 
Stylianou,  and  Moulines.  Further  research  remains  to  determine  the  relative  performance  of 
the  recommended  techniques. 
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